Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C3ID implementation #21

Merged
merged 12 commits into from
Jan 15, 2025
Merged

Add C3ID implementation #21

merged 12 commits into from
Jan 15, 2025

Conversation

nguerrera
Copy link
Contributor

@nguerrera nguerrera commented Jan 8, 2025

This implements C3ID computation that we will need soon.

The implementation differs from the one in https://github.com/microsoft/security-utilities and this seeks to become the new standard.

Changes:

  • There is no use of "Microsoft"
  • There is no byte array to hex conversion in the intermediate computation
  • Optimized to avoid allocations
  • Raw value is truncated to 12 bytes instead of 15
  • "C3ID" is prefixed to the canonical representation in base64

Tests include a naive, unoptimized implementation that is easier to understand, and we check that both implementations match. We also check that we produce the same deterministic value for a sampling of hard-coded test input.

This change also fixes edge cases with empty span input in encoding polyfills.

This ports C3ID computation (that we will need soon) from https://github.com/microsoft/security-utilities

Changes:
- Company name is a parameter and not hard-coded to Microsoft
- Optimized like the rest of Cask to avoid allocations and write the result to a span

Tests added here were also run against the implementation in security-utilities to check equivalence.

Also: fix edge cases with empty span input in encoding polyfills.
@nguerrera
Copy link
Contributor Author

#22 should fix the build issue. Let's get that reviewed and merged first.

@nguerrera nguerrera closed this Jan 9, 2025
@nguerrera nguerrera reopened this Jan 9, 2025
[InlineData("😁", "f/BTV0j6A8km4KDw7aJz")]
public void Test_Basic(string text, string expected)
{
string actual = ComputeC3IDBase64(company: "Microsoft", text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey is the proper rendering C3ID or C3id?

Haha. It is a true fact that I treat this as a valid question despite the obvious triviality of the concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer C3id but need I mention I have no desire to die on this hill? :)

Copy link
Member

@rwoll rwoll Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tehe. I think I prefer C3Id or C3ID over C3id. Id is a word so if the convention in C# is CreateUrl, C3Id would be the most consistent.

Copy link
Contributor Author

@nguerrera nguerrera Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not appear in any identifier other than in tests. As such, my care factor is vanishingly small. I'm just as picky about these things in public API, so no judgment. 😁 I'll happily change it, but I still don't know which to use...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with C3Id now. I need it in non-test code in my next change. Still not public, but I think it's worth settling this right now and not having noise of renaming it mixed in with other changes.

My reason for C3Id over C3id is a nod to my much younger self that rewrote the FxCop casing tokenizer ~20 years ago. That would parse this as two tokens C3 and id. And only C3Id respects the guidelines with that tokenization. 😁

I'd ask to get rid of the number in the middle of an acronym, but I think I've caused enough trouble asking for algo changes.

- Use base64-decoded prefix for hash input and base64-encode input
- Throw on empty or whitespace
Prefix.CopyTo(input);
SHA256.HashData(textUtf8, input[Prefix.Length..]);

// Perform second hash, truncate, and copy to destination.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting concerned about the cost here. Two rounds of SHA256 is a lot of compute for a 12 byte hash.

For context from offline conversation: the rationale (as I understand it, but I may not) for two rounds in C3ID is to make it possible to compute the C3ID from the plain SHA256. Without sharing the key and without adding C3ID to their toolkit, someone can send you the SHA256 of a key, and then you can convert that to C3ID.

We should maybe reconsider if this scenario is worth the cost added to CASK scenarios. I'll provide some data on that cost in a minute.

Copy link
Contributor Author

@nguerrera nguerrera Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without C3ID in hash scenarios

From #18 (comment)

Method Mean Error StdDev Gen0 Allocated
CompareHash_Cask 451.609 ns 3.9292 ns 3.6753 ns - -
CompareHash_Floor 294.875 ns 2.7652 ns 2.5866 ns - -
GenerateHash_Cask 300.978 ns 1.3669 ns 1.2786 ns 0.0091 176 B
GenerateHash_Floor 225.569 ns 1.6476 ns 1.2864 ns 0.0057 112 B

With C3ID in hash scenarios

From my work-in-progress branch:

Method Mean Error StdDev Gen0 Allocated
CompareHash_Cask 856.6 ns 6.06 ns 5.37 ns - -
CompareHash_Floor 298.5 ns 3.69 ns 3.27 ns - -
GenerateHash_Cask 725.1 ns 5.20 ns 4.86 ns 0.0114 216 B
GenerateHash_Floor 226.3 ns 0.78 ns 0.69 ns 0.0057 112 B

There are some other changes in this branch so this doesn't quite isolate the C3ID cost. However, I also measured on this machine that a single SHA256 of Cask key sized data -- without allocation, copying, or encoding conversion -- takes about 120ns on this machine. That's a lot relative to the "floor" scenarios we're aiming to replace.

Up to now, I was confident in being able to tell a good story about how the extra Cask costs were a fair price to pay for Cask features, but I am not sure C3ID is living up to that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that said, I think it would be good to take this change and discuss options to improve this in follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on this and let's talk more.

Copy link
Member

@michaelcfanning michaelcfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@michaelcfanning michaelcfanning merged commit 1ec3741 into microsoft:main Jan 15, 2025
8 checks passed
@nguerrera nguerrera deleted the c3id branch January 15, 2025 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants